library(kableExtra)
library(tidyverse)
library(dplyr)
library(ggpubr)
library(lattice)
library(lme4)

Motivation

[LRJ: Once we finalize the case study, we should come back and revisit the intro/motivation section]

There are over 1400 public schools in Maryland and there are also outstanding students in each school. The federal Every Student Succeeds Act (ESSA) prompted states to develop long term plans to improve schools through accountability and innovation, which sets Maryland’s schools on the path to continuous improvement. Maryland Report Card website aims to share the most current information available to help stakeholders understand and measure student achievement in all 24 local school systems. Here is a message from the State Superintendent of Schools that enables you to know more about how it functions.

[source: https://www.youtube.com/watch?time_continue=8&v=g5ylQgdisTM]

Our analysis is inspired by this plan - figure out how well schools were performing and how different factors influenced their performance. Once we indentify schools that need improvements and influential factors, both Maryland’s government and local organizations can prompt actions and provide necessary support, in a way that is understanable and reliable.

The libraries used in this study are listed in the following table, along with their purpose in this particular case study:

Library Purpose
kableExtra Helps with building common complex tables and manipulating table styles; creates nice-looking HTML tables
tidyverse A coherent system of packages for data manipulation, exploration and visualization
dplyr Helps you solve the most common data manipulation challenges

In order to run this code please ensure you have these packages installed.

The learning objectives for this case study include:

What is the data?

The data files were downloaded from the Maryland Report Card website. Data files can be found by clicking Data Download at the bottom of the web page.

At this link, you can find data for all 1400+ schools in Maryland from 2003 to 2018. For this case study, we will focus on data on high schools from 2017. For this analysis, we will extract data from 5 of the files for that year:

Data Import

After downloading the 5 data files, we can write a script to import all the files in the same folder into R relatively easily. In this case, we’ve saved all files in a folder named data2017:

Knowing which environment you are in is necessary since it enables R to find these files quickly and accurately. Since our data files are in the folder data2017, we will need to specify this in the path to each file.

Instead of importing each file with an individual line of code, we can use the function lapply() to import them all simultaneously. To do this, first we make a vector of all of the names for the files we want to import. The list.files() function will make a vector of all files in a folder that match a given pattern. Here we want all .csv files, so we can use pattern="*.csv" to specify all files that end with a .csv. The * character means any set of characters can be in that location in the file name.

files = list.files(path='./data2017',pattern="*.csv")
files
## [1] "Cohort_Grad_Rate_2017.csv"         "PARCC_2017.csv"                   
## [3] "Special_Services_2017.csv"         "Student_Mobility_2017.csv"        
## [5] "Wealth_Expenditures_Data_2017.csv"

Next we create a vector of paths to these files by pasting the folder name, data2017, to the path for each file.

filePaths = paste0("./data2017/", files)
filePaths
## [1] "./data2017/Cohort_Grad_Rate_2017.csv"        
## [2] "./data2017/PARCC_2017.csv"                   
## [3] "./data2017/Special_Services_2017.csv"        
## [4] "./data2017/Student_Mobility_2017.csv"        
## [5] "./data2017/Wealth_Expenditures_Data_2017.csv"

Finally, we can read in each data file. The lapply() function will apply the same function repeatedly to each item in a vector and store the results in a list. The function we want to apply repeatedly is the read_csv() function; by giving the vector of file paths and this function to lapply(), we can ready in all 5 data files at once.

[LRJ: I think we want to use the read_csv() function here instead of read.csv() to stay within the tidyverse. However, when I make this change it changes some of the names of the variables, since it doesn’t replace spaces with dots. I do want to make this change, but then will need you to update the later code so that the names will work and the screenshots are accurate.]

data <- lapply(filePaths, read.csv, header=TRUE)
#data <- lapply(filePaths, read_csv)

Now we have a data object that consists of a list of data sets. We can access an individual data set with the [[]] operator.

#data
#data[[1]]
head(data[[1]])
##   Class.of.Year LEA LEA.Name School.Number
## 1          2017   1 Allegany           405
## 2          2017   1 Allegany           405
## 3          2017   1 Allegany           601
## 4          2017   1 Allegany           601
## 5          2017   1 Allegany           606
## 6          2017   1 Allegany           606
##                               School.Name                 Cohort Grad.Rate
## 1                          Fort Hill High 4-year adjusted cohort     82.86
## 2                          Fort Hill High 3-year adjusted cohort   <= 5.00
## 3 Center for Career & Technical Education 4-year adjusted cohort  >= 95.00
## 4 Center for Career & Technical Education 3-year adjusted cohort   <= 5.00
## 5                           Allegany High 4-year adjusted cohort     90.78
## 6                           Allegany High 3-year adjusted cohort   <= 5.00
##   Diplomas.Earned Adjusted.Cohort.Count Create.Date
## 1             145                   175    20180117
## 2               *                   163    20180117
## 3               *                   132    20180117
## 4               *                   137    20180117
## 5             128                   141    20180117
## 6               *                   133    20180117

Data Wrangling

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

[LRJ: is the quote from a book? then we need to cite the book and not just the author]

A large part of the case study involves data wrangling to get these five data files into a single data set that we can analyze. To do this, we will alternate between viewing the data and performing data wrangling steps. In this case-study, we will provide screenshots of the data; you can view the data in the same way by using the View() function on the data frame we are working with.

Graduation Rate

The first data file (Cohort_Grad_Rate_2017.csv) records the percentage of students in each school who received a Maryland high school diploma during 2017. Let’s look at the data frame that comes from this file:

View(data[[1]])

This data frame has a total of 579 rows with 10 variables. Looking at this data frame in R, we see there are several issues we will need to address, such as possible missing values designated by *. We will deal with these issues in later steps.

Feature Selection and Rename

Our first step is to subset to only school-level data. We notice that if School.Number equals ‘A’, then this row corresponds to county level data of all of the high-schools in the county. We also notice that there are two different graduation rates listed for different cohorts of students:

  • 4-year adjusted cohort graduation rate: the percentage of a school’s cohort of first-time 9th grade students who graduate within four years, adjusted for students who transfer in and out of the cohort after 9th grade.
  • 3-year adjusted cohort graduation rate: the percentage of a school’s cohort of first-time 9th grade students who graduate within three years, adjusted for students who transfer in and out of the cohort after 9th grade.

Since high school students typically graduate after four years, it makes more sense to use the 4-year adjusted cohort as our graduation rate variable. We also see this by looking at the values for these two rates – the 3-year rate is generally quite low while the 4-year rate is generally in the 80-100% range we would expect.

Function filter() and select() in package dplyr will help us to finish the first wrangling step.

df_grad <- data[[1]] %>%
 filter( School.Number != 'A', 
        Cohort == '4-year adjusted cohort') %>%
 select(LEA.Name, School.Number,
         School.Name, Grad.Rate)

colnames(df_grad) <- c('county', 'sch.num',
                       'sch', 'grad.rate')

See what we get! Only school level information and useful variables are kept, leaving us a neater dataset. Next, missing values should be fixed.

Missing Value

There are a lot of missing values in the dataset, which would influence further analysis if we leave them there. From the first plot, missing values exist in different forms:

  • '>= 95.00','<= 5.00' : The graduation rate is higher than 95% or lower than 5%.
  • '*' : This schools’ graduation rate is simply not present in the data.

Use function summary() to produce the descriptive statistics of these missing values.

summary(df_grad[2:4]) 
##     sch.num                                            sch     
##  301    :  4   Chesapeake High                           :  2  
##  102    :  3   Frederick Douglass High                   :  2  
##  207    :  3   Northwestern High                         :  2  
##  405    :  3   Aberdeen High                             :  1  
##  705    :  3   Academy for College and Career Exploration:  1  
##  108    :  2   Academy of Health Sciences at PGCC        :  1  
##  (Other):246   (Other)                                   :255  
##     grad.rate  
##  >= 95.00: 57  
##  *       : 27  
##  <= 5.00 :  9  
##  84.02   :  2  
##  84.4    :  2  
##  86.21   :  2  
##  (Other) :165

Pay attention to the first kind of missing value. It is impossible to calculate the exact graduate rate because of the lack of information, so we decide to use gsub() function to replace incomplete values. gsub() function replaces all matches of a string. Elements of string vectors which are not substituted will be returned unchanged.

Without knowledge of exact value, 2.5 and 97.5 are used to replace ‘<= 5.00’ and ‘>= 95.00’ respectively. By the way, graduation rate lower than 5% sounds abnormal - let’s look at these school before we replace them.

head(df_grad[df_grad$grad.rate == '<= 5.00', ])
##               county sch.num                                        sch
## 23  Baltimore County      54              Extended Day Learning Program
## 24  Baltimore County      58                 Home Assignments-Secondary
## 25  Baltimore County      69 Catonsville Center for Alternative Studies
## 26  Baltimore County      72                            Rosedale Center
## 28  Baltimore County      77                    BCDC Educational Center
## 157       Montgomery     916                        Rock Terrace School
##     grad.rate
## 23    <= 5.00
## 24    <= 5.00
## 25    <= 5.00
## 26    <= 5.00
## 28    <= 5.00
## 157   <= 5.00

Search Extended Day Learning Program and Home Assignments-Secondary, they provides special education and special needs school programs, which may be designed for adults - explaining the low graduation rate.

df_grad$grad.rate<- gsub('<= 5.00','2.5', df_grad$grad.rate)
df_grad$grad.rate <- gsub('>= 95.00','97.5', df_grad$grad.rate)

The second type of missing value (’*’) is confusing, let’s extract them out and have a look:

head(df_grad[df_grad$grad.rate == '*', ])
##              county sch.num                      sch grad.rate
## 6      Anne Arundel    1274       Marley Glen School         *
## 16     Anne Arundel    3414 Ruth Parker Eason School         *
## 21     Anne Arundel    4304   Central Special School         *
## 27 Baltimore County      75        Crossroads Center         *
## 29 Baltimore County     111     Maiden Choice School         *
## 41 Baltimore County     922             Ridge Ruxton         *

If we search Marley Glen School, we may found it mainly provides program for students with disabilities. Also, Ruth Parker Eason School provides a special education program for students with moderate to severe disabilities. So we guess schools with ’*‘graduation rate mainly provide education for students with moderate to severe disabilities. We can either delete it or replace them as ’NA’. Considering we need to merge multiple files in later steps, such kind of information may be dropped out automatically. Ths most appropriate method for us is replacing them as ‘NA’ and deal with them all together.

df_grad$grad.rate <- na_if(df_grad$grad.rate, '*')
df_grad$grad.rate <- as.numeric(df_grad$grad.rate)

Unique id

Our last step is to extract information from several files and merge them together, requiring a key that specifies each observation uniquely. Possible choices in present dataset can be sch.num or sch.name but we need to check whether each of them appear once in the dataset. If not, they are not suitable choice because of the lack of uniqueness.

summary(df_grad[,c('sch.num', 'sch')])
##     sch.num                                            sch     
##  301    :  4   Chesapeake High                           :  2  
##  102    :  3   Frederick Douglass High                   :  2  
##  207    :  3   Northwestern High                         :  2  
##  405    :  3   Aberdeen High                             :  1  
##  705    :  3   Academy for College and Career Exploration:  1  
##  108    :  2   Academy of Health Sciences at PGCC        :  1  
##  (Other):246   (Other)                                   :255

From the summary, whether sch.num or sch.name, the frequency of some values is larger than 1. Therefore it is definitely important to define a unique element by ourselves.

df_grad <- df_grad %>%
    within( id <- paste(county, sch, sep = '-'))

The combination of county and sch generates a new variable - id. To check its uniqueness, we use function table(), which builds a contingency table of the counts at each combination of factor levels. The as.data.frame() function converts the array-based representation of a contingency table to a data frame containing two variable: each factor Var1and its frequency Freq.

tab <- as.data.frame(table(df_grad$id))
tab[tab$Freq > 1,]
## [1] Var1 Freq
## <0 rows> (or 0-length row.names)

Up to now, a unique key element id is generated and we finally create the ideal dataset - a tidy dataset.

PARCC

Partnership for Assessment of Readiness for College and Careers (PARCC) reflects schools’ academic achievements through assessing the performance of students on state standardized tests, such as English and Math. It not only provides information about students mastery of state standards, but also offers teachers and parents with timely information to inform instruction and how to provide support. From this file, we want to extract the percentage of students scoring ‘high performance’, which works as the target variable is data analysis part.

colnames(data[[2]])
##  [1] "Academic.Year"                                   
##  [2] "LEA.Number"                                      
##  [3] "LEA.Name"                                        
##  [4] "School.Number"                                   
##  [5] "School.Name"                                     
##  [6] "Assessment"                                      
##  [7] "Tested.Count"                                    
##  [8] "Level.1..Did.not.yet.meet.expectations...Count"  
##  [9] "Level.1..Did.not.yet.meet.expectations...Percent"
## [10] "Level.2..Partially.met.expectations...Count"     
## [11] "Level.2..Partially.met.expectations...Percent"   
## [12] "Level.3..Approached.expectations...Count"        
## [13] "Level.3..Approached.expectations...Percent"      
## [14] "Level.4..Met.expectations...Count"               
## [15] "Level.4..Met.expectations...Percent"             
## [16] "Level.5..Exceeded.expectations...Count"          
## [17] "Level.5..Exceeded.expectations...Percent"        
## [18] "Create.Date"

PARCC is a large file with 8989 observations and 18 variables. We notice there are five levels of performance indicators, ranging from Level 1 to Level 5. First, we use percentage information rather than count information. Second, we will create a new variable, pro, that indicates the percentage of students performing at the ‘met expectations’ and ‘exceeded expectations’ levels.

Feature Selection and Rename

Similiar to graduation rate file, only school level information and necessary variables (LEA.Name, School.Number, School.Name, Assessment and five level indicators) will be kept and renamed.

df_parcc <- data[[2]] %>%
  filter( School.Number != 'A' ) %>%
  select(3,4,5,6,9,11,13,15,17)

colnames(df_parcc) <- c('county','sch.num', 
                        'sch', 'subject','L1',
                        'L2','L3','L4','L5')

Variable subject reflects the kind of subject test. English/Language Arts Grade 10, Algebra 1 and 2 are filtered out by function filter() in package dplyr. You are encouraged to keep other assessments that you are interested in.

df_parcc <- df_parcc %>%
  filter(subject %in% c('English/Language Arts Grade 10','Algebra 1','Algebra 2'))

Missing Value

Missing value is the toughest problem in this wrangling part. Let’s see its summary first.

summary(df_parcc[, 5:9])
##        L1            L2            L3            L4            L5     
##  <= 5.0 :273   <= 5.0 :176   <= 5.0 :127   <= 5.0 :151   <= 5.0 :597  
##  17.6   :  7   28.6   :  9   33.3   : 10   33.3   :  7   6.8    :  7  
##  5.5    :  7   15.2   :  8   22.7   :  8   75     :  7   19.8   :  5  
##  15.8   :  6   16.7   :  7   25     :  8   11.4   :  5   5.6    :  5  
##  5.6    :  6   17.9   :  7   14.3   :  7   17.6   :  5   5.1    :  4  
##  6.7    :  6   18.2   :  7   19.7   :  7   34.1   :  5   8.6    :  4  
##  (Other):569   (Other):660   (Other):707   (Other):694   (Other):252

There is only one kind of missing value: <=5.0 in these 5 levels indicators. We will replace it as NA, then transform them to numerical data.

df_parcc[,5:9] <- 
  lapply(df_parcc[,5:9], function(x) as.numeric(as.character(x)))

We only care about the proportion of L4 and L5 but NAs are distributed differently. To simply wrangling steps, we reduce these five levels into two new levels : ‘Level weak’ (L1, L2 and L3) and ‘Level excellent’ (L4 and L5). Then deal with NAs in three ways. If values of ‘Level excellent’ are complete, then pro equals to sum of L4 and L5. Such as Algebra 1 assessment of ‘Washington Middle’, its pro should be \(85.5 + 9.7 = 95.2\); If values of ‘Level weak’ all exist, pro equals to 100 percent minus the sum of L1, L2 and L3. For example, the pro of Algebra 1 for ‘Fort Hill High’ equals to \(100-23.9-44.6-21.7 = 9.8\);

df_parcc$pro <- NA

indx1 <- !is.na(df_parcc$L4) &  !is.na(df_parcc$L5)
df_parcc[indx1, 'pro'] <- rowSums(df_parcc[indx1,8:9])

sum(is.na(df_parcc$pro))
## [1] 597
indx2 <- !is.na(df_parcc$L1) &  !is.na(df_parcc$L2) & !is.na(df_parcc$L3)

df_parcc[indx2, 'pro'] <-  100-rowSums(df_parcc[indx2,5:7])

sum(is.na(df_parcc$pro)) 
## [1] 203

Function is.na() indicates which elements are missing and returns a boolean index of the same shape as the original data frame. df_parcc[indx1,] and df_parcc[indx2,] specify observations have complete information of ‘Level excellent’ and ‘Level weak’ respectively. sum(is.na(df_parcc$pro)) shows the counts of NA in variable pro. Undoubtedly, our methods reduce NAs greatly after we apply corresponding methods mentioned above.

If missing values exist both in these two levels, we will first use 100 percent minus existing values. Then divide the difference to the number of missing value and replace NA with this quotient. For instance - ‘Brooklyn Park Middle’, there are three NAs, so \(NA = (100-20.8-72.7)/3=2.2\), and pro equals to \(72.7+2.2=74.9\). The reason we don’t replace these NAs as 2.5 directly is that there is a restriction we don’t want to obey - the sum of 5 levels equals to 100 percent. It is more reasonable to assume missing values are uniformly distributed than replace them with a fixed value directly.

indx3 <- is.na(df_parcc$pro)


na_sum <- 100-rowSums(df_parcc[indx3, 5:9], na.rm = TRUE)
n <- rowSums(is.na(df_parcc[indx3, 5:9]))
na <- round(na_sum/n,1)

for (i in 1:length(indx3)){
  df_parcc[indx3, 5:9][i,which(is.na(df_parcc[indx3, 5:9][i,]))] <- na[i]
}

df_parcc[indx3, 'pro'] <- rowSums(df_parcc[indx3,8:9])
sum(is.na(df_parcc$pro)) 
## [1] 0

What the above script does? First, rows containing NA in pro are filtered out and recorded as indx3, totally 874 observations. Then, for each row, we calculate the difference of 100 percent and sum of existing values - na_sum, and the number of missing value - n. Dividing na_sum by n gives the quotients na. All of these three variables are vectors which length are the same as indx3 - 203. The last step is replacement and we write a for loop here. For each row (df_parcc[indx3, 5:9][i,]), columns containing NA are selected out with the help of function is.na(). Applying function which() enables us to get the corresponding columns indexs. Once we determine the positions of NA, we can assign what we calculated before - na to these NAs.

Now, we have 0 NAs, which proves the validity and feasibility of our methods. The following step is to create a unique id - still use the combination of county name and school name. And we only keep new variable pro in the final dataset

pac <- df_parcc %>%
  within( id <- paste(county, sch, sep = '-')) %>%
  select(county, sch.num, sch, subject, pro, id)

This is how the final parcc dataset looks like:

What’s more, the last two rows - The Seed School of Maryland, lacks information about which county it belongs to.

tail(pac)
##             county sch.num                              sch
## 869 Baltimore City     454 Carver Vocational-Technical High
## 870 Baltimore City     480           Baltimore City College
## 871 Baltimore City     480           Baltimore City College
## 872 Baltimore City     480           Baltimore City College
## 873           SEED    1000      The Seed School of Maryland
## 874           SEED    1000      The Seed School of Maryland
##                            subject  pro
## 869                      Algebra 2  1.0
## 870 English/Language Arts Grade 10 45.1
## 871                      Algebra 1 27.7
## 872                      Algebra 2 16.7
## 873 English/Language Arts Grade 10 38.4
## 874                      Algebra 1  9.8
##                                                  id
## 869 Baltimore City-Carver Vocational-Technical High
## 870           Baltimore City-Baltimore City College
## 871           Baltimore City-Baltimore City College
## 872           Baltimore City-Baltimore City College
## 873                SEED-The Seed School of Maryland
## 874                SEED-The Seed School of Maryland

After check online, we find it is not a traditional high school but a boarding school that draws students from all over the state of Maryland. So it wouldn’t be right to assign them to any particular county. We choose to exclude them from our analysis.

pac <- pac %>%
  filter( county != 'SEED')

Extract Assessment Information

The proportion of high performance in these assessment (\(p\)) plays an role of response variable in our later analysis. Thus, splicting them out first makes further analysis convenient.

\[ p_{ela} = \beta_0+\beta_{1} x_1+ \beta_{2} x_2 \]

df_ela <- pac[grep("English/Language Arts Grade 10", pac$subject), ]
df_alg1 <- pac[grep("Algebra 1", pac$subject), ]
df_alg2 <- pac[grep("Algebra 2", pac$subject), ]
dim(df_ela)  #size of df_ela
## [1] 233   6
dim(df_alg1) #size of df_alg1
## [1] 472   6
dim(df_alg2) #size of df_alg2
## [1] 167   6

It’s wired! The size of df_alg1 is larger than the other two so let’s look at it:

head(df_alg1)
##      county sch.num                                     sch   subject  pro
## 2  Allegany     405                          Fort Hill High Algebra 1  9.8
## 4  Allegany     406                       Washington Middle Algebra 1 95.2
## 5  Allegany     504                         Braddock Middle Algebra 1 84.4
## 6  Allegany     601 Center for Career & Technical Education Algebra 1  7.6
## 9  Allegany     606                           Allegany High Algebra 1  8.4
## 11 Allegany     802                          Westmar Middle Algebra 1 87.9
##                                                  id
## 2                           Allegany-Fort Hill High
## 4                        Allegany-Washington Middle
## 5                          Allegany-Braddock Middle
## 6  Allegany-Center for Career & Technical Education
## 9                            Allegany-Allegany High
## 11                          Allegany-Westmar Middle

It is because it contains middle school information! Actually for Algebra 1, students will take the exam the year they learn it so that’s why both middle school and high school have this test. Besides, pro of middle schools are much higher than that of high schools. It is because students who take Algebra1 test in middle schoos are more likely to be good at this subject. This fact makes them incompariable with students who take Algebra 1 test in high schools. pro for high school can only reflect part of the academic achievements - those who may not be good at Algebra 1. Pay attention to this point in later analysis.

How to extract the high school information? Here variable id comes to work! df_grad only includes information for high schools. So if obseravations in df_alg1 also exist in df_grad, it corresponds to high school, otherwise to middle school. The unique variable id enables us to check the co-existness with the help of function filter().

df_alg1 <- df_alg1 %>%
  filter(id %in% df_grad$id)

dim(df_alg1)
## [1] 226   6

Special Services

Special_Services_2017.csv records the number and percentage of students who applied different special service and students approved through direct certification. Such variables make us know more about the school quality and students support. Here we choose service FARMS: receive free or reduced price meals.

What the following script does is similiar to previous wrangling steps except one step. Since in special service file there is a vairable called School.Type, we can filter high school information easily if we add a new condition School.Type == 'High'.

df_spc <- data[[3]] %>%
  filter( School.Number != 'A', School.Type == 'High' ) %>%
  within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
  select(LEA.Name, School.Number, School.Name, 
           FARMS.Pct, id)

colnames(df_spc) <- c('county', 'sch.num', 'sch', 'farms', 'id')

summary(df_spc)
##               county      sch.num   
##  Baltimore City  :41   0301   :  4  
##  Prince George's :37   0102   :  3  
##  Baltimore County:33   0207   :  3  
##  Montgomery      :31   0405   :  3  
##  Anne Arundel    :18   0705   :  3  
##  Howard          :14   0108   :  2  
##  (Other)         :94   (Other):250  
##                                          sch          farms    
##  Chesapeake High                           :  2   *      : 12  
##  Frederick Douglass High                   :  2   <= 5.0 :  7  
##  Northwestern High                         :  2   35.1   :  3  
##  Aberdeen High                             :  1   55.4   :  3  
##  Academy for College and Career Exploration:  1   12.5   :  2  
##  Academy of Health Sciences at PGCC        :  1   13.8   :  2  
##  (Other)                                   :259   (Other):239  
##       id           
##  Length:268        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

There is only one problem in this dataset, also it is an old problem - missing value.

Missing Value

Similiarly, let’s have a look on what these schools with ’*’ in farms look like:

df_spc[df_spc$farms == '*',]
##               county sch.num                              sch farms
## 6       Anne Arundel    1274               Marley Glen School     *
## 10      Anne Arundel    2233        Anne Arundel Evening High     *
## 28  Baltimore County    0077          BCDC Educational Center     *
## 57           Calvert    0206           Calvert Country School     *
## 70           Carroll    0712           Carroll Springs School     *
## 72           Carroll    0717           Post Secondary Program     *
## 123           Howard    0522        Cedar Lane Special Center     *
## 156       Montgomery    0799            Stephen Knolls School     *
## 159       Montgomery    0951                  Longview School     *
## 196  Prince George's    2217 Incarcerated Youth Center (JACS)     *
## 220         Wicomico    0520     Wicomico County Evening High     *
## 267   Baltimore City    0884             Eager Street Academy     *
##                                                   id
## 6                    Anne Arundel-Marley Glen School
## 10            Anne Arundel-Anne Arundel Evening High
## 28          Baltimore County-BCDC Educational Center
## 57                    Calvert-Calvert Country School
## 70                    Carroll-Carroll Springs School
## 72                    Carroll-Post Secondary Program
## 123                 Howard-Cedar Lane Special Center
## 156                 Montgomery-Stephen Knolls School
## 159                       Montgomery-Longview School
## 196 Prince George's-Incarcerated Youth Center (JACS)
## 220            Wicomico-Wicomico County Evening High
## 267              Baltimore City-Eager Street Academy

It seems that they are still not ‘normal’ public schools so we change ’*’ to NA.

After replacing ‘<= 5.0’ to ‘2.5’ with function gsub(), if you apply as.numeric() function, ’*’ will be transformed to NA automatically.

df_spc$farms <- gsub('<= 5.0','2.5', df_spc$farms)
df_spc$farms <- as.numeric(df_spc$farms)

This is how the final special serivce dataset looks like:

Student Mobility

Student_Mobility_2017.csv includes information of the movement of students from one school to another during the school year. There are 3 types of mobility provided: total student mobility, entry mobility, and exit mobility. To some extent, exit mobility reflects students’ satisfaction with their schools so we would like to use it as variable withdraw in our analysis.

As usual, wrangling steps are high school observation selection, unique id creation, feature selection and missing value transformation.

df_mob <- data[[4]] %>%
  filter( School.Number != 'A', School.Type == 'High' )%>%
  within( id <- paste(LEA.Name, School.Name, sep = '-')) %>%
  select(LEA.Name, School.Number, 
           School.Name, Withdrawals.Pct, id)

colnames(df_mob) <- c('county', 'sch.num', 'sch', 'withdraw', 'id')

summary(df_mob)
##               county      sch.num   
##  Baltimore City  :41   301    :  4  
##  Prince George's :37   102    :  3  
##  Baltimore County:33   207    :  3  
##  Montgomery      :31   405    :  3  
##  Anne Arundel    :18   705    :  3  
##  Howard          :14   108    :  2  
##  (Other)         :94   (Other):250  
##                                          sch         withdraw  
##  Chesapeake High                           :  2   <= 5.0 : 77  
##  Frederick Douglass High                   :  2   >= 95.0:  9  
##  Northwestern High                         :  2   5.5    :  8  
##  Aberdeen High                             :  1   8.6    :  5  
##  Academy for College and Career Exploration:  1   10.3   :  3  
##  Academy of Health Sciences at PGCC        :  1   10.9   :  3  
##  (Other)                                   :259   (Other):163  
##       id           
##  Length:268        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
df_mob$withdraw <- gsub('<= 5.0','2.5', df_mob$withdraw)
df_mob$withdraw <- gsub('>= 95.0','97.5', df_mob$withdraw)
df_mob$withdraw <- as.numeric(df_mob$withdraw)

This is how the final mobility dataset looks like:

Wealth Expenditures

Wealth_Expenditures_Data_2017.csv contains information of different kinds of investments as well as wealth per pupil. What makes it special is that its size (11 variables and 25 observations) is obsivously smaller than previous files. Let’s have a look on it:

It turns out that this file only contains county level data! The last row is state level ‘All Public School’ which needs to will be deleted when we merge datasets. Then, we would like to keep variable county which will work as the key in later merge steps. Besides, a new variable exp will be created by the formula:

\[exp = \frac{Wealth.Per.Pupil}{Expenditures.Per.Pupil}\]

The quotient of Wealth.Per.Pupil and Expenditures.Per.Pupil captures the image of each county’s financial condition. Function mutate() in package dplyr is a nice tool to adds new variables. Thus, wrangling steps are new variable creation, feature selection and rename.

df_exp <- data[[5]] %>%
  mutate(exp = round(Wealth.Per.Pupil/Expenditures.Per.Pupil, 1)) %>%
  select(LEA.Name, exp)

colnames(df_exp) <- c('county','exp')

We notice the last row is information about ‘All Public School’, but it will be dropped automatically when we merge it.

Final Dataset

Relational Datasets

Now we have totally 8 datasets, requiring combination to answer the questions that we are interested in. They are called relational data because it is their relations, not just the individual datasets, that are important. Relations are defined between a pair in these datasets. In out case, relations presented in the form that the same variable exists in multiple datasets. The following plot describes how our datasets connect:

We can see that variable county exists in all datasets, variable sch, sch.name and id exists in all datasets except df_exp. In addition, each dataset contain its unique variables.

To work with relational data, mutating joins is pretty powerful which combines variables from two datasets. It first matches observations by their keys, then copies across variables from one dataset to the other. keys is a variable that uniquely indentifies an observation so it can connect each pair of datasets. This is the meaning of the existence of variable id - sufficiently indentifies each observation in our datasets.

mutating joins includes inner joins and outer joins. An inner join keeps observations that appear in both datasets while outer join keeps observations that appear in at least one of the datasets. There are three types of outer joins: left join, right join and full join. The graphical explanation is given below:

Inner join:

Outer join:

To be more specific, a table is provided to show detailed descriptions.

Types How it works Implementation in dplyr
inner join keeps observations that appear in both datasets inner_join()
left join keeps all observations in x but drops bbservations in y but not in x left_join()
right join keeps all observations in y but drops bbservations in x but not in y right_join()
full join keeps all observations in x and y full_join()

Left join is the most common and popular method, especially when we only need part of variables from another dataset. It preserves the original observations even when there isn’t a match - just make it as NA. This is also our best choice and we will show how it works in our case.

Example

The final dataset should include predict variables grad.rate, withdraw, farms, exp and target variable pro. Three datasets with this structure will be created for df_ela, df_alg1 and df_alg2 respectively. As mentioned, we would focus on left join with the help of function left_join(). Let’s see an example for df_ela first:

temp <- df_ela %>%
   select(id, county, pro) %>%
  left_join(df_grad[, c('grad.rate', 'id')], by = 'id') 

head(temp)
##                                    id       county  pro grad.rate
## 1             Allegany-Fort Hill High     Allegany 40.9     82.86
## 2              Allegany-Allegany High     Allegany 52.9     90.78
## 3 Allegany-Mountain Ridge High School     Allegany 54.2     87.82
## 4       Anne Arundel-Glen Burnie High Anne Arundel 37.9     90.30
## 5      Anne Arundel-North County High Anne Arundel 49.5     86.18
## 6      Anne Arundel-Severna Park High Anne Arundel 83.1     97.50
tail(temp)
##                                                               id
## 228 Baltimore City-Augusta Fells Savage Institute of Visual Arts
## 229                                Baltimore City-Coppin Academy
## 230                           Baltimore City-Renaissance Academy
## 231                       Baltimore City-Frederick Douglass High
## 232              Baltimore City-Carver Vocational-Technical High
## 233                        Baltimore City-Baltimore City College
##             county  pro grad.rate
## 228 Baltimore City  2.4     54.20
## 229 Baltimore City  9.3     91.57
## 230 Baltimore City  3.2     52.00
## 231 Baltimore City  3.2     70.10
## 232 Baltimore City  2.8     76.17
## 233 Baltimore City 45.1     97.50

Look! grad.rate in df_grad was added to df_ela successfully. Option by= enables you to specify the keys you would like to use. Then, we repeat this merge step for df_mob, df_spc and df_exp.

Dataset Variable Kept Keys
df_grad grad.rate: 4-year graduation rate id
df_mob withdraw: students’ exit mobility id
df_spc farms: percentage of students reciving free/reduced price meals id
df_exp exp: quotient of wealth and expenditures county
temp1 <- df_ela %>%
  select(id, county, pro) %>%
  left_join(df_grad[, c('grad.rate', 'id')], by = 'id') %>%
  left_join( df_mob[, c('withdraw', 'id')], by = 'id') %>%
  left_join( df_spc[, c('farms', 'id')], by = 'id') %>%
  left_join(df_exp, by = 'county')

Instead of repeating by ourselves, the R base package provides a function Reduce(), which can come in handy. Reduce() takes a function \(f\) of two arguments and a list or vector \(x\) which is to be reduced using \(f\). The function is first called on the first two components of \(x\), then with the result of that as the first argument and the third component of \(x\) as the second argument, then again with the result of the second step as first argument and the fourth component of \(x\) as the second argument etc. The process is continued until all elements of \(x\) have been processed.

L_ela <- list(df_ela[, c('id', 'county', 'pro')], 
              df_grad[, c('id','grad.rate')], 
              df_mob[, c('id','withdraw')], 
              df_spc[, c('id','farms')], df_exp)

temp2 <- Reduce(function(x,y) left_join(x, y, 
                                        by = colnames(y)[1]), L_ela)
# `by = colnames(y)[1])` tells how to find the *keys* in each left_join. 

The above script shows how Reduce() function works in out case. First, these five files are save in list L_ela. Then function left_join() was applied one by one in list L_ela with function Reduce(). temp2 is the same dataset as temp1. So, let’s write it as a function and apply it to df_alg1 and df_alg2.

Apply Function

func_merge <- function(data){
L <- list(data[, c('id', 'county', 'pro')], 
              df_grad[, c('id','grad.rate')], 
              df_mob[, c('id','withdraw')], 
              df_spc[, c('id','farms')], df_exp)
  
  final_data<- Reduce(function(x,y) left_join(x, y, 
                                        by = colnames(y)[1]), L)
  return(final_data)
}

df_ELA <- func_merge(df_ela)
df_ALG1 <- func_merge(df_alg1)
df_ALG2 <- func_merge(df_alg2)

Wonderful! Finally we get the ideal datasets: df_ELA, df_ALG1 and df_ALG2.

Exploratory data analysis

# Boxplot
p1 <- ggplot(df_ALG1, aes(x = reorder(county, pro), 
  y = pro,fill = county)) + geom_boxplot()+
  guides(fill = FALSE) + theme_bw() + 
    theme(plot.title = element_text(hjust = 0.5 ), 
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5)) + 
  labs( x = 'County', y = 'Distribution', 
        caption = 'proportion of high performance for Algebra 1 in year 2017')

ALG1_sum <- df_ALG1 %>% 
  group_by(county) %>% 
  summarise(mean = mean(pro),
            sd = sd(pro))
# Line plot
p2 <- ggplot(ALG1_sum, aes( reorder(factor(county), mean), mean, group = 1)) +
  geom_line(alpha = 1/3) +
  geom_pointrange(aes(ymax = mean + sd, ymin = mean - sd)) +
  labs(x = 'County', caption = 'Error bars') +
  theme_minimal() + 
    theme(plot.title = element_text(hjust = 0.5 ), 
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))

ggarrange(p1, p2, ncol = 2)

distplot <- function(data, variable){
  p <- ggplot(data, aes(x = reorder(county, variable), 
  y = variable, fill = county)) + geom_boxplot()+
  guides(fill = FALSE) + theme_bw() + 
    theme(plot.title = element_text(hjust = 0.5 ), 
          axis.text.x = element_text(angle = 90, hjust = 1,vjust = 0.5))+
  xlab("County") + ylab("Distribution") 
  
  
  return(p)
}

p3 <- distplot(df_ALG1, df_ALG1$pro)
p4 <- distplot(df_ALG2, df_ALG2$pro)
p5 <- distplot(df_ELA, df_ELA$pro)

ggarrange(p3, p4, p5, nrow = 2, ncol = 2)

p6 <- ggplot(data = df_ALG1, aes(x = grad.rate, y = pro))+    
  geom_point()+ 
  geom_smooth(method = "lm", se = FALSE)+
  theme(legend.position = "none") 

p7 <- ggplot(data = df_ALG1, aes(x = grad.rate, y = pro,group = county))+    
  facet_grid( ~ county, switch = 'x')+
  geom_point(aes(colour = county))+ 
  geom_smooth(method = "lm", se = FALSE, aes(colour = county))+
  theme(strip.text=element_text(angle = 90))+
  theme(legend.position = "none") +
  theme(axis.text = element_blank()) +
  theme(axis.ticks = element_blank()) 

ggarrange(p6, p7, nrow = 2)

#xyplot(pro~grad.rate |county, df_ALG1,
#        type = c("p", "smooth"))

Data analysis

To evaluate influence of each predictor variable, linear regression may first comes to your mind. Let’s fit a simple linear regression model here:

m <- lm(pro ~ grad.rate + withdraw + farms + exp, data = df_ELA)
summary(m)
## 
## Call:
## lm(formula = pro ~ grad.rate + withdraw + farms + exp, data = df_ELA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.674  -7.764  -0.321   6.993  55.423 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.26712   12.14055   3.729 0.000245 ***
## grad.rate    0.19937    0.11298   1.765 0.079011 .  
## withdraw    -0.13216    0.11418  -1.157 0.248336    
## farms       -0.82627    0.05685 -14.533  < 2e-16 ***
## exp          0.50565    0.08758   5.774 2.62e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.62 on 220 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7731, Adjusted R-squared:  0.7689 
## F-statistic: 187.4 on 4 and 220 DF,  p-value: < 2.2e-16

Nevertheless, what makes our dataset different is its nested data structure: units are clustered into subgroups that are also nested within larger groups. We use a plot to indicate our dataset’s structure:

As shown above, the dataset can be devided into two levels: county level (L1) and school level (L2). Such structure is common in human and biological science. For instance, in educational settings, students are clustered into schools and schools are clustered nto districts, which are part of a larger system (county, state or country). The subgroup-specific characteristics may have some potential influence on outcomes. Another typical example can be longtitude data, a variable’s values over time are correlated with each other. In this case, linear regression fail to capture the influence of clustering so we need a more nuanced analysis.

Multilevel models (MLM) are statistical models of parameters that vary at more than one level. To be specific, a two-level model allows for variation in both school level and county level, which cannot be accounted for in traditional ordinary least squares regression. County level variation, also called ‘county effects’, represents a source of dependence that influence schools’ performance. MLM always includes fixed and random parameters. Fixed parameters are composed of a constant over all the groups, whereas a random parameter has a different value for each of the groups.

This section will go over how to run MLM for our dataset in R, and focus on estimating fixed and random parameters as well as how to interpret the output. This section will build fairly simple and slightly more complex models, as the primary focus of this section is on proper model building and interpretation rather than an exhaustive review of highly complex MLM models. This section will also include a short review of different library options for HLM analyses.

Simple model

First, we will begin with an unconditional model, which will provide a general basis for more sophisticated models to follow. This unconditional model asks the question, ‘Is there variability in Algebra 1 achievement across county?’

m1 <- lmer (pro ~ 1 + 1|county, data=df_ALG1)

Function lmer() in package lme4 is a powerful tool to fit MLM. Package lme4 will allow you to analyze data using restricted maximum likelihood estimation (REML) rather than ordinary least squares (OLS), making it ideal for MLM analyses.

Multilevel model